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Abstract 

In multimedia, text or bioinformatics databases, applications query se- 
quences of n consecutive symbols called n-grams. Estimating the number of 
distinct n-grams is a view-size estimation problem. While view sizes can be 
estimated by sampling under statistical assumptions, we desire an unassum- 
ing algorithm with universally valid accuracy bounds. Most related work has 
focused on repeatedly hashing the data, which is prohibitive for large data 
sources. We prove that a one-pass one-hash algorithm is sufficient for accu- 
rate estimates if the hashing is sufficiently independent. To reduce costs fur- 
ther, we investigate recursive random hashing algorithms and show that they 
are sufficiently independent in practice. We compare our running times with 
exact counts using suffix arrays and show that, while we use hardly any stor- 
age, we are an order of magnitude faster. The approach further is extended to 
a one-pass/one-hash computation of «-gram entropy and iceberg counts. The 
experiments use a large collection of English text from the Gutenberg Project 
as well as synthetic data. 



1 Introduction 

Consider a sequence of symbols a,- £ £ of length N. Perhaps the data source has 
high latency, for example, it is not in a flat binary format or in a DBMS, making 

*This is an expanded version of |LK06|. 
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random access and skipping impractical. The symbols need not be characters from 
a natural language: they can be particular "events" inferred from a sensor or a news 
feed, they can be financial or biomedical patterns found in time series, they can be 
words in a natural language, and so on. While small compared to the amount of 
memory available, the number of distinct symbols (|E|) could be large: on the 
order of 10 5 in the case of words in a typical English dictionary or 10 7 in the case 
of the Google 5-gram data set [FB06]. We make no other assumption about the 
distribution of these distinct symbols. 

An n-gram is a consecutive sequence of n symbols. Given a data source con- 
taining N symbols, there are up to N — n distinct «-grams. We use «-grams in 
language modeling [GZ01 1, pattern recognition [YTH90|, predicting web page ac- 
cesses [DK04|, information retrieval [NGZZ00|, text categorization and author at- 
tribution ICMS01ILTSB061IKC04I . speech recognition [Jel98|, multimedia IPK03I . 
music retrieval [DR03|, text mining ILOK00I . information theory |Sha48|. soft- 
ware fault diagnosis [BS05|, data compression [BH84], data mining ISYLZOOI . 
indexing [KW LL051 . On-line Analytical Processing (OLAP) IKKL05al . optimal 
character recognition (OCR) [Dro03], automated translation [LH03|, time series 
segmentation [CHA02|, and so on. This paper concerns the use of previously pub- 
lished hash functions for «-grams, together with recent randomized algorithms for 
estimating the number of distinct items in a stream of data. Together, they permit 
memory-efficient estimation of the number of distinct «-grams. 

The number of distinct «-grams grows large with n: Google makes available 
1.1 x 10 9 word 5-grams, each occurring more than 400 times in 10 12 words of 
text IFB06I . On a smaller scale, storing Shakespeare's First Folio [Pro06| takes 
up about 4.6 MiB but we can verify that it has over 3 million distinct 1 5-grams of 
characters. If each distinct rc-gram can be stored using log(4.6 x 10 6 ) w 22 bits, 
then we need about 8.4 MiB just to store the rc-grams (3 x 10 6 x 22/8 « 8.3 x 10 6 ) 
without counting the indexing overhead. Thus, storing and indexing «-grams can 
use up more storage than the original data source. Extrapolating this to the large 
corpora used in computational linguistic studies, we see the futility of using brute- 
force approaches that store the «-grams in main memory, when n is large. For 
smaller values of n, «-grams of English words are also likely to defeat brute-force 
approaches. 

Even if storage is not an issue, indexing a very large number m of distinct 
Ti-gramsis computationally expensive because of the overhead associated with in- 
dexing data structures. In practice, we have a bound on processing time or stor- 
age and we may need to focus on selected «-grams such as «-grams occurring 
more than once or less than 10 times (iceberg «-grams) depending on the esti- 
mates. Moreover, when only a count or an entropy estimate is required for cost 
optimizers or user feedback, materializing all n-grams is unnecessary. Finally, de- 



termining the number of infrequent «-grams is important for building "next word" 
indexes [BWZ02, CP06| since inverted indexes [ZMR98| are most efficient over 
rare words or phrases. Therefore, estimating online and quickly the ra-gram statis- 
tics in a single pass is an important problem. 

There are two strategies for estimation of statistics of a sequence in a single 
pass IKMR+9 4 1 1 BDK R021 IGMV06I . The generative (or black-box) strategy sam- 
ples values at random. From the samples, the probabilities of each value is esti- 
mated by maximum likelihood or other statistical techniques. The evaluative strat- 
egy, on the other hand, probes the exact probabilities or, equivalently, the number of 
occurrences of (possibly randomly) chosen values. In one pass, we can randomly 
probe several «-grams so we know their exact frequency. 

On the one hand, it is difficult to estimate the number of distinct elements from 
a sampling, without making further assumptions. For example, suppose there is 
only one distinct 72-gram in 100 samples out of 100,000 ?i-grams. Should we con- 
clude that there is only one distinct 71-gram overall? Perhaps there are 100 distinct 
7i-grams, but 99 of them only occur once, thus there is a w 91% probability that we 
observe only the common one. While this example is a bit extreme, skewed distri- 
butions are quite common as the Zipf law shows. Choosing, a priori, the number 
of samples we require is a major difficulty. Estimating the probabilities from sam- 
pling is a problem that still interests researchers to this day [MS00|[OSZ03 1. 

On the other hand, distinct count estimates from a probing are statistically eas- 
ier IGT01I . With the example above, with just enough storage budget to store 
100 distinct 7i-grams, we would get an exact count estimate! On the downside, 
probing requires properly randomized hashing. 

In the spirit of probing, Gibbons-Tirthapura (GT) IGTOll count estimation 
goes as follows. We have m distinct items in a stream containing the distinct items 
x\,... ,x m with possible repetitions. Let h(xi) be pairwise independent hash val- 
ues over [0,2 L ) and let h t (xi) be the first t bits of the hash value. We have that 
£(card({/i f _1 (0)})) =m/2 1 . Given a fixed memory budget M, and setting t = 0, 
as we scan, we store all distinct items x,- such that h t (xi) = in a look-up table H. 
As soon as size(H) = M + 1, we increment t by 1 and remove all Xi in H such that 
h t (xi) 7^ 0. Typically, at least one element in H is removed, but if not, the process 
of incrementing t and removing items is repeated until size(//) < M. Then we 
continue scanning. After the run is completed, we return size(//) x 2' as the esti- 
mate. By choosing M = 576/s 2 [BYJK+02|, we achieve an accuracy of £, 5 times 
out of 6 (P(|size(//) x 2' — m\ > em) < 1/6), by an application of Chebyshev's in- 
equality. By Chernoff 's bound, running the algorithm 0(log 1/8) times and taking 
the median of the results gives a reliability of 8 instead of 5/6. Bar-Yossef et al. 
suggest to improve the algorithm by storing hash values of the x,'s instead of the 
x ( 's themselves, reducing the reliability but lowering the memory usage. Notice 



that our Corollary ^ shows that the estimate of a 5/6 reliability for M = 576/s 2 
is pessimistic: M = 576/s 2 implies a reliability of over 99%. We also prove that 
replacing pairwise independent by 4-wise independent hashing substantially im- 
proves the existing theoretical performance bounds 1 . 

Random hashing can be the real bottleneck in probing, but to alleviate this 
problem for «-gram hashing, we use recursive hashing [ Coh97 , KR87 1 : we lever- 
age the fact that successive «-grams have n — 1 characters in common. We study 
empirically online rc-gram statistical estimations (counts, iceberg counts and en- 
tropy) in one pass that hashes each rc-gram only once. We compare several dif- 
ferent recursive «-gram hashing algorithms including hashing by cyclic and irre- 
ducible polynomials in the binary Galois Field (GF(2)[je]). The main contributions 
of this paper are a tighter theoretical bound in count estimation and an experimen- 
tal validation to demonstrate practical usefulness. This work has a wide range of 
applications, from other view-size estimation problems to text mining. 

2 Related Work 

Related work includes reservoir sampling, suffix arrays, and view-size estimation 
in OLAP. 

2.1 Reservoir Sampling 

We can choose randomly, without replacement, k samples in a sequence of un- 
known length using a single pass through the data by reservoir sampling. Reservoir 
sampling [Vit85 |[KW06 ( Li94| was introduced by Knuth [Knu69]. All reservoir 
sampling algorithms begin by appending the first k samples to an array. In their 
linear time (0(N)) form, reservoir sampling algorithms sequentially visit every 
symbol choosing it as a possible sample with probability k/t where t is the number 
of symbols read so far. The chosen sample is simply appended at the end of the 
array while an existing sample is flagged as having been removed. The array has 
an average size of k{\ + \ogN /k) samples at the end of the run. In their sublinear 
form (0(k(l +log(N/k)) expected time), the algorithms skip a random number of 
data points each time. While these algorithms use a single pass, they assume that 
the number of required samples k is known a priori, but this is difficult without any 
knowledge of the data distribution. 

1 The application of p-wise independent hash functions for estimating frequency moments is well 
known IBGKSQg| |iW05l . 



2.2 Suffix Arrays 



Using suffix arrays [ MM90 1 [ MM93 1 and the length of the maximal common prefix 
between successive prefixes, Nagao and Mori [NM94| proposed a fast algorithm 
to compute rc-gram statistics exactly. However, it cannot be considered an online 
algorithm even if we compute the suffix array in one pass: after constructing the 
suffix array, one must go through all suffixes at least once more. Their implemen- 
tation was later improved by Kit and Wilks [KW98]. Unlike suffix trees [GKS99|, 
uncompressed suffix arrays do not require several times the storage of the original 
document and their performance does not depend on the size of the alphabet. Suf- 
fix arrays can be constructed in 0(N) time using 0(N) working space [HSS03|. 
Querying a suffix array for a given «-gram takes 0(logN) time. 

2.3 View-Size Estimation in OLAP 

By definition, each n-gram is a tuple of length n and can be viewed as a relation 
to be aggregated. OLAP (On-Line Analytical Processing) [ Cod93 1 is a database 
acceleration technique used for deductive analysis, typically involving aggrega- 
tion. To achieve acceleration, one frequently builds data cubes [ GBLP96 1 where 
multidimensional relations are pre-aggregated in multidimensional arrays. OLAP 
is commonly used for business purposes with dimensions such as time, location, 
sales, expenses, and so on. Concerning text, most work has focused on infor- 
metrics/bibliomining, document management and information retrieval [MLC + 00 
EV1CDA03I iNHJXBl IBer95l ISulOll . The idea of using OLAP for exploring the text 
content itself (including phrases and «-grams) was proposed for the first time by 
Keith, Kaser and Lemire [KKL05b KKL05a|. The estimation of 72-gram counts 
can be viewed as an OLAP view-size estimation problem which itself "remains an 
important area of open research" [DERC06|. A data-agnostic approach to view- 
size estimation [SDNR96|, which is likely to be used by database vendors, can 
be computed almost instantly as long as we know how many attributes each di- 
mension has and the number of relations r|. For «-gram estimation, the number 
of attributes is the size of the alphabet |E| and r\ is the number of rc-grams with 
possible repetitions (f\=N — n + l). 

Given r\ cells picked uniformly at random, with replacement, in a V = K\ x 
K2 x • ■ K n space, the probability that any given cell (think "n-gram") is omitted is 
(1 — For 7i-grams, V = \L\". Therefore, the expected number of unoccupied 
cells is (1 - ■^) T| x rj. 

Similarly, assuming the number of ra-grams is known to be m, the same model 
permits us to estimate the number of n — 1 -grams by m x (1 — {^) )• m P rac ~ 
tice,these approaches overestimatesystematically because relations are not uniformly 



distributed. 

A more sophisticated view-size estimation algorithm used in the context of 
data warehousing and OLAP [ SDNR96 , Kot02 ] is logarithmic probabilistic count- 
ing HFM85I . This approach requires a single pass and almost no memory, but 
it assumes independent hashing for which no algorithm using limited storage is 
known [BYJK + 02|. Practical results are sometimes disappointing [DERC06|, pos- 
sibly because many random hash values need to be computed for each data point. 
Other variants of this approach include linear probabilistic counting I WVZT90I 
ISRR04I and loglog counting [DF03). 

View-size estimation through sampling has been made adaptive by Haas et 
al. [HNSS95 1: their strategy is to first attempt to determine whether the distribution 
is skewed and then use an appropriate statistical estimator. We can also count the 
marginal frequencies of each attribute value (or symbol in an 72-gram setting) and 
use them to give estimates as well as (exact) lower and upper bound on the view 
size [YZS05|. Other researchers make particular assumptions on the distribution 
of relations INT03I ICGR01I ICGR03I IFMS96H . 

3 Multidimensional Random Hashing 

Hashing encodes an object as a fixed-length bit string for comparison. Multidimen- 
sional hashing is a particular form of hashing where the objects can be represented 
as tuples. Multidimensional hashing is of general interest since several commonly 
occurring objects can be thought of as tuples: 32-bit values can be seen as 8-tuples 
containing 4-bit values. 

For convenience, we consider hash functions mapping keys to [0,2 L ), where 
the set U of possible keys is much larger than 2 L . A difficulty with hashing is that 
any particular hash function h has some "bad inputs" S C U over which some hash 
value (such as 0) is either too frequent or not frequent enough (card(/i _1 (0)) 96 
card(5)/2 L ) making count estimates from hash values difficult. Rather than make 
assumptions about the probabilities of bad inputs for a particular fixed hash func- 
tion, an alternative approach [CW79| selects the hash function randomly from 
some family H of functions, all of which map U to [0,2 L ). 

Clearly, some families 9{ have desirable properties that other families do not 
have. For instance, consider a family whose members always map to even num- 
bers — then considering the random possible selections of h from 9{ , for any 
x G U we have P(h(x) = i) = for any odd i. This would be an undesirable 
property for many applications. We now mention some desirable properties of 
families. 9{ is uniform if, considering h selected uniformly at random from H 
and for all x and y, we have P{h{x) = y) = \/2 L . This condition is too weak; 



the family of constant functions is uniform but would be disastrous when used 
with the GT algorithm. We need stronger conditions implying that any particular 
member h of the family must hash objects evenly over [0,2 L ). H is pairwise in- 
dependent or universal [CW79| if for all x\, x 2 , y, z with x\ ^ x 2 , we have that 
P(h(xi) =yAh(x 2 ) = z) = P{h(xi) = y)P(h(x 2 ) = z) = 1/4 L . We will refer to 
such an h € H as a "pairwise independent hash function" when the family in ques- 
tion can be inferred from the context or is not important. Pairwise independence 
implies uniformity. 

Gibbons and Tirthapura showed that pairwise independence was sufficient to 
approximate count statistics [GT01 1 essentially because the variance of the sum of 
pairwise independent variables is just the sum the variances (Var(Xi + . . . +Xj) = 
Var(Xi) + . . . + Var(X y )). A well-known example of a pairwise-independent hash 
function for keys in the range [0,B r+1 ), where B is prime, is computed as follows. 
Express key x as x r x r -\ . ..xq in base B. Randomly choose a number a G [0,2 r+1 ) 
and express it as a r a r -\ . . .cio in base B. Then, set h(x) = YH=o a i x i (mod B). The 
proof that it is pairwise independent follows from the fact that integers modulo a 
prime numbers form a field (GF(B)). 

Moreover, the idea of pairwise independence can be generalized: a family of 
hash functions H is k-wise independent if given distinct xt, . ■ ■ ,Xk and given h se- 
lected uniformly at random from "K , then P(h(x\) =y\ A • ■ ■ Ah(x/ C ) =yk) = 1 /2 kL . 
Note that &-wise independence implies k — 1-wise independence and uniformity. 
(Fully) independent hash functions are &-wise independent for arbitrarily large k. 
Siegel | Sie89 1 has shown ^-wise-independent hash functions from [0,2*) to [0,2*) 
can be evaluated in 0(1) time in a unit-cost RAM model. Whether his approach 
can be efficiently implemented is unclear, and in any case his results are not directly 
applicable to us. For instance, we assume a cost of £2(n) to process an rc-gram 2 . 
This paper contributes better bounds for approximate count statistics, providing 
that more fully independent hash functions are used (4-wise instead of pairwise, 
for instance). 

In the context of n-gram hashing, we seek recursive families of hash functions 
so that we can compute new hash values quickly by reusing previous hash values. 
A hash function h is recursive if there exist a (fixed) function F over triples such 
that 

h(x2,X3,. . . ,x n+ i) =F(h(xi ,xi, ■ ■ ■ ,x„),xi ,x n+ i). 
The function F must be independent of h. (F is common to all members of the 

2 We do assume O(l) time to access the first or last symbol in an n-gram and 0(1) expected time 
to find such a symbol in a look-up table. As well, we also implicitly make unit-cost assumptions 
when calculating the L-bit hash values. 



family). By extension 3 , a hash function h is recursive over hash values x(xi), where 
x is a randomized hash function for symbols, if there is a function F, independent 
of x and h such that 



h(x 2 ,-.. 



x n +\) =F(h[xi,... 



Similarly, a hash function h is semi-recursive if there exists a (fixed) function 
G over pairs such that 



This relates hashing an n + 1-gram to hashing an overlapping rc-gram, whereas 
recursive hashing involves two overlapping «-grams. Hence, we cannot say that 
recursive hash functions are semi-recursive. 

As an example of a recursive hash function, given tuples (x\ ,x 2 , • • • ,x n ) whose 
components are integers taken from [0,5), we can hash by the Karp-Rabin for- 
mula Y,i x i^' 1 m °d R, where R is some prime defining the range of the hash func- 
tion [KR87 GBY90|. This is semi-recursive, with G(v,x n+ \) = {v + x n+ \B n ) mod 
R and also recursive. Regardless, it is a poor hash function for us, since «-grams 
with common suffixes all get very similar hash values. For probabilistic-counting 
approaches based on the number of trailing zeros of the hashed value, if h[x\ ,x 2 , . . . ,Xn) 
has many trailing zeros, then we know that h{x\ ,x 2 , ■ ■ ■ ,x n -\ ,x' n ) has few trailing 
zeros (assuming x n ^ x' n ). 

In fact, no recursive hash function can be pairwise independent. 

Proposition 1 Recursive hash functions are not pairwise independent. 

Proof: Suppose h is a recursive pairwise independent hash function. Let hij = 
h(xi,...,Xj). Then F(hi >n ,xi,x n +i) = ^2,n+i where the function F must be inde- 
pendent of h. 

Fix the values x\ , . . . ,x n+ \ , then 



As the next lemma and proposition show, being recursive over hashed values, 
while a weaker requirement, does not allow more than pairwise independence. 

3 This extended sense resembles Cohen's use of "recursive." 



h(x u x 2 , ■ ■ ■ ,x n+l ) = G(h(xi,x 2 , ■ ■ ■ ,x n ),x n+ i). 




a contradiction. 



□ 



Lemma 1 Let F be the recursive function of any recursive uniform hash function, 
then, for v,w fixed, Xx.F(x, v,w) is one-to-one. 



Proof: We need to show that F(x,v,w) = y and F(x' ,v,w) = y implies x = x'. 
Consider a sequence, v,X2, ■ . ■ ,x n ,w and any uniform hash function h, then 

^j- = P(h(x 2 ,...,x n ,w) =y) = P(F(h(v,x 2 ,...,x n ),v,w) = y) 
£ P(h(v,x 2 ,...,x n )=r})= £ 

T\€{z\F(z,v,w)=y} r\e{z\F(z,v,w)=y} 

_ card({z.\F (z,v,w) = y}) 
2 L ' 

and hence caid({z\F (z,v,w) = y}) = 1 showing the result. □ 



Proposition 2 Recursive hashing functions over hashed values cannot be 3-wise 
independent. 



Proof: Consider the string of symbols a"bb, recalling that a and b are arbitrary 
but distinct members of £. 
We have that 

P(h(a, . . . , a) =x,h(a, . . . , a,b) = y,h(&, . . . ,a,b,b) = y) 
= P(h(a,...,a) =x,F(x,z(a),x(b)) =y,F(y,x(a),x(b)) =y). 

However, we can only have F(jc,T(a),T(b)) = y and F(j,x(a),x(b)) = y if x = y 
by Lemma^and so, the above probability is zero unless x = y, preventing 3-wise 
independence. □ 
A trivial way to generate an independent hash is to assign a random integer 
in [0,2 L ) to each new value x. Unfortunately, this requires as much processing 
and storage as a complete indexing of all values. However, in a multidimensional 
setting this approach can be put to good use. Suppose that we have tuples in 
K\ x K 2 x • ■ • x K n such that \K,\ is small for all i. We can construct indepen- 
dent hash functions hi : Kj — > [0,2 L ) for all i and combine them. For example, the 
hash function h[x\ ,x 2 , . . . ,x n ) = hi (x\ ) © h 2 (x2) ® ■ ■ ■ © h n (x n ) is n-wise indepen- 
dent (© is the exclusive or). As long as the sets Kj, are small, in time \Ki\) we 
can construct the hash function by generating \Kj\ random numbers and storing 
them in a look-up table. With constant-time look-up, hashing an 72-gram thus takes 
0{Ln) time, or 0(n) if L is considered a constant. 



Unfortunately, this hash function is not recursive but it is semi-recursive. In 
the «-gram context, we can choose h\=h% = ... since £ = K\ = K 2 = . ■ . While 
the resulting hash function is recursive over hashed values since 

h(x 2 ,...,x n+1 ) = h 1 (x 2 )®---®h(x n+l ) 

= hx(xi)®hi(x n+ i)®h(xi,...,x n ), 

it is no longer even pairwise independent, since P(h(a,b) = w,h(b,a) = v) = if 

To obtain some of the speed benefits of recursive hashing with p > 2-wise 
independence, a hybrid approach might be useful when n is large. It splits the n- 
gram into p pieces, each of which can be updated. For simplicity, suppose n is a 
multiple of p and n>2p. We use 

h(xi,X2,...x n ) = hi(xi)@hi(x 2 )®...®h l (x n / p ) 
®h 2 {x( n /p)+i) ® ■ ■ ■®h 2 (x 2n /p) 
ffi... 

)n/p)+l) © • • ■®hp{x n ). 

To update, note that 

h(x 2 ,...,x n+ i) = h(xi,...,x n )®h\(xi) 

®(h(x n/p + i)®h 2 (x n/p + i)) 
©... 

®(hp-i(x(( p _i) n / p ) +1 ) ® /ip(x(( p _i) n /p) + i) 
®h p (x n+ i). 

Thus, we have a /j-wise independent recursive hash function. 

For 72-gram estimation, we seek families of hash functions that behave, for 
practical purposes, like rc-wise independent while being recursive over hash values. 
A particularly interesting form of hashing using the binary Galois field GF(2) is 
called "Recursive Hashing by Polynomials" and has been attributed to Kubina by 
Cohen [Coh97|. GF(2) contains only two values (1 and 0) with the addition (and 
hence subtraction) defined by "exclusive or", a + b = a © b and the multiplication 
by "and", a x b = a f\b. GF(2)[x] is the vector space of all polynomials with 
coefficients from GF(2). Any integer in binary form (e.g. c = 1101) can thus be 
interpreted as an element of GF(2)[x] (e.g. c = x 3 +x 2 + 1). If p{x) € GF(2)[x], 
then GF(2)[x]//?(x) can be thought of as GF(2)[x] modulo p(x). As an example, 
if p(x) = x 2 , then GF(2) [x]/p(x) is the set of all linear polynomials. For instance, 



x 3 +x 2 +x+l =x+l (modx 2 ) since, inGF(2)[x], (x + 1) +x 2 {x + 1) = x 3 + 
x 2 +x+ 1. 

Interpreting /ji hash values as polynomials in GF(2) [x]/p(x), and with the con- 
dition that degree (p(x)) > n, we define h(a\,a%,. . . ,a n ) = h\(a\)x n ~ l +h\ {a,2)x n ~ 2 + 
. . . + h\(a n ). It is recursive over the sequence h\(a.i). The combined hash can be 
computed in constant time with respect to n by reusing previous hash values: 

h(a 2 ,a 3 ,...,a n+ i) = xh(ai,a 2 ,.. . ,a n ) -h\(ai)x" +h\(a n+ \). 

Choosing p(x) = x L + 1 for L > n, for any polynomial q(x) = £^r <7iX', we 
have 

xq{x) = x{qi-\X L ~ l + . . . + q\x + q Q ) = qr.-!^ 1 + • • • + qo* + qL-l- 

Thus, we have that multiplication by x is a cyclic left shift. The resulting hash is 
called Recursive Hashing by Cyclic Polynomials ICoh97l . or (for short) 
CYCLIC. It was shown empirically to be uniform |Coh97], but it is not formally 
so: 

Lemma 2 CYCLIC is not uniform for n even and never pairwise independent. 

Proof: If n is even, x"^ 1 + ... +x+ 1 is divisible by x+ 1, so x" _1 + . . . +x+ 1 = 
(x + l)r(jc) for some polynomial r{x). Clearly, r(x)(x + \){x L ~ l +x L ~ 2 + . . . + 
x+1) =0 (mod^+l) for any r{x) and so P(h(a n ) =0) = P((x"" 1 + . . . +x + 
l)h l (a)=0)=P((x+l)r(x)h l (a)=0)>P(h 1 (a)=OVh l (a) = r- l (x)(x L - l + 
x L ~ 2 + . . . +x+ 1)) = 1 /2 L_1 . Therefore, CYCLIC is not uniform for n even. 

To show CYCLIC is never pairwise independent, consider n = 3 (for sim- 
plicity), then P(h(aab) = ft(aba)) = P((x+ l)Oi(a) +h 1 (b)) = 0) > P{h y (a) + 
hi(b) = OVAi(a) +hi(b) =x L ~ l +x L ^ 2 + . . . +x+ 1) = 1/2 L_1 , but pairwise in- 
dependent hash values are equal with probability 1 /2 L . The result is shown. □ 

In contrast to CYCLIC, to generate hash functions over [0,2 L ) we can choose 
p(x) to be an irreducible polynomial of degree L in GF(2) [jc]. For L = 19, an exam- 
ple choice is p(x) = l+x 2 +x 3 +x 5 +x 6 +x 7 +x 12 +x 16 +x 17 +x 18 +x 19 IRus06l . 
(With this particular irreducible polynomial, L = 19 and so we require n < 19. 
Irreducible polynomials of larger degree can be found [ Rus06 1 if desired.) Com- 
puting (cti8X 18 + . . . + a$)x (mod p(x)) as a polynomial of degree 18 or less, for 
representation in 19 bits, can be done efficiently. We have (a^x 18 + . . . + ao)x = 
a\s(p(x) — x 19 ) + a\jx ls ... + a^x (mod p{x)) and the polynomial on the right- 
hand-side is of degree at most 18. In practice, we do a left shift of the first 18 bits 
of the hash value and if the value of the 19 tn bit is 1, then we apply an exclusive or 



with the integer 1 + 2 2 + 2 3 + 2 5 + 2 6 + 2 7 + 2 12 + 2 16 + 2 17 + 2 18 + 2 19 . The result- 
ing hash is called Recursive Hashing by General Polynomials [Coh97], 
or (for short) General. The main benefit of setting p(x) to be an irreducible 
polynomial is that GF(2)[x)/p(x) is a field; in particular, it is no longer possible 
that Pi(x)p 2 (x) = (mod p(x)) unless either p\(x) = or p 2 (x) = 0. The field 
property allows us to prove that the hash function is pairwise independent (see 
Lemma|3l), but it is not 3-wise independent because of Proposition |2] There is also 
a direct argument: 

x{h{x\,x\,X2) — h(xi,xi,X[)) + h(x\,xi,xi) = h(xi,x 2 ,xi). 

In the sense of Proposition it is an optimal recursive hashing function. 

Lemma 3 General is pairwise independent. 

Proof: If p(x) is irreducible, then q(x) G GF(2)[x]/ p[x) has an inverse, noted 
q~ l (x) since GF(2)[x]//?(x) is a field. Interpret hash values as polynomials in 
G¥{2)[x]/p{x). 

Firstly, we prove that General is uniform. In fact, we show a stronger result: 
P{q\{x)h\{a\) i^W/iife) + • • - + a n{x)h\ (a n ) =y) = 1/2 L for any set of poly- 
nomials qi with at least one of them different from zero. The result is shown by 
induction on the number of non-zero polynomials: it is clearly true where there is 
a single non-zero polynomial. Suppose it is true for up to k — 1 non-zero polyno- 
mials and consider a case where we have k non-zero polynomials. Assume with- 
out loss of generality that q\ (x) ^ 0, we have P{q\ {x)h\ (a\ ) + q 2 {x)h\ {aq) + . . • + 
q n {x)hi(a n ) =y) = P{hi(a\) = q^ x (x)(y - q 2 (x)/ii (a 2 ) - ■ ■ ■ - q n {x)h\ («„))) = 
ZyP{hi(ai) = q^ 1 (x){y -y'))P(q 2 (x)h l (a 2 ) + . . ■+q„(x)h l (a n ) =/)=£y jp^ = 
Ijp by the induction argument. Hence the uniformity result is shown. 

Consider two distinct sequences a\,a 2 , . . . ,a n and We have that 

P(h(ai,a 2 , ...,a n ) =yAh(a' l ,a' 2 ,...,a' n ) =y') =P(h(a u a 2 ,. . . ,a n ) =y\h(a' l ,a' 2 , . . . . 
y')P(h(a\,a' 2 , ■ ■ ■ ,a' n ) = y'). Hence, to prove pairwise independence, it suffices to 
show that P(h(ai,a 2 , ...,a n ) =y\h(a' l ,a 2 ,...,a' n ) =y') = 1/2 L . 

Suppose that a, = d- for some i,j; if not, the result follows since by the (full) in- 
dependence of the hashing function h i , the values h( ) and h( 
are independent. Write q(x) = —(Y,k\a k =a i xk )(Y.k\a' k =a'- xk )~ 1 > tnen K a i 

, Cl 2 , . . . ,Cl n 

)+ 

q(x)h(a\,a 2 , . . . ,a! n ) is independent from a,- = a': (and h\{a{) = h(a'j)). 

In h (a \,a 2 , . ..,a n ) + q(x)h(a\ ,a 2 , ... ,a' n ), only hashed values different from 
h\(a,i) remain: label them h\ (b\), . . . ,h\(b m ). The result of the substitution can be 
written h(a\,a 2 , . . . ,a n ) + q(x)h(a\ ,a 2 , . . . ,a' n ) = L&^M^i (bt) where qt{x) are 
polynomials in GF(2)[x]/ p(x). All qk(x) are zero if and only if h(ci\,a 2 , . . . ,a n ) + 



q(x)h(a[,a' 2 ,. . . ,a' n ) = for all values of hi(ai ),..., h\{a n ) and hi(a[), ... ,h\(a' n ) 
(but notice that the value /ji (a,) = is irrelevant); in particular, it must be true 
when h\ (c?,) = 1 and h\ (a-) = 1 for all i, hence (1 + x + . . . +x n ) + + x + 

. . . = => q(x) = — 1. Thus, all qk(x) are zero if and only if h(a\,a%, . . . ,a n ) = 
h(a\,a' 2 ,...,a' n ) for all values of h\{a\) , . . . ,h\{a n ) and h\(a\),...,h\(a' n ) which 
only happens if the sequences a and a 1 are identical. Hence, not all qu{x) are zero. 

The condition h(a[,a' 2 ,. . ■ ,a' n ) = y' can be rewritten as h l (a / j) = (Y,k\d k =a' xk ) ~ 
Lkla'M^h-i (a' k )), and this last condition is clearly independent from h{a\ , Cl2i ■ ■ ■ ) + 
q(x)h(a\,a 2 , ■ ■ ■ ,a' n ) = y + q(x)y' where h\ (a'j) = h\ (a;) does not appear. We have 

P(h(ai,a 2 ,-.-,a n ) =y\h(a[,a' 2 , . . . ,a' n ) =/) 

= P(h(a u a 2 , ...,a n ) +q(x)h(a[,a 2 ,. ..,a' n )=y + q(x)y') 

= PC£^qk{x)h Y {b k ) =y + q{x)y') 
k 

and by the earlier uniformity result, this last probability is equal to 1/2 L . This 
concludes the proof. □ 
Of the four recursive hashing functions investigated by Cohen, these two were 
superior both in terms of speed and uniformity, though CYCLIC had a small edge 
over General. For n large, the benefits of these recursive hash functions com- 
pared to the rc-wise independent hash function presented earlier can be substantial: 
n table look-ups 4 is much more expensive than a single look-up followed by binary 
shifts. 

A variation of the Karp-Rabin hash method is "Hashing by Power-of-2 Integer 
Division" 1Coh 97 1 . where h{x\ ,x n ) = Y,i*iB m °d 2 L . Parameter B needs to 
be chosen carefully, so that the sequence B k mod 2 L for k = 1,2, . . . does not repeat 
quickly. In particular, the hashcode method for the Java 1.5 String class uses this 
approach, with L = 32 and B = 31 [Sun04|. Note that B is much smaller than 
the range of values that the 16-bit Unicode characters can assume. A widely used 
textbook [ Wei99 p. 157] recommends a similar Integer Division hash function for 
strings with B = 37 (e.g., £,-x,-37'~ 1 mod R) and hints that R should be prime. Since 
such Integer Division hash functions are recursive, quickly computed, and widely 
used, it is interesting to seek a randomized version of them. Assume that h\ is 
random hash function over symbols uniform in [0,2 L ), then define h{x\ ,x n ) = 
h\{x\) +Bhi(x 2 ) +B 2 h\(x^) + ... +B n ~ x h\ (x n ) (mod 2 L ) for some fixed integer 
B. We choose B = 37 (calling the resulting randomized hash "ID37"). Observe 
that it has a long cycle when R = 2 L . We do not require h\{xi) E [0,B). 

4 Recall we assume that £ is not known in advance. Otherwise for many applications, each table 
lookup could be merely an array access. 



Observe that ID37 is recursive over hi. Moreover, by letting hi map symbols 
over a wide range, we intuitively can reduce the undesirable dependence between 
ra-grams sharing a common suffix. However, in doing so we destroy a more funda- 
mental property: uniformity. 

The problem with ID37 is shared by all such randomized Integer-Division hash 
functions that map «-grams to [0,2 L ). However, they are more severe for certain 
combinations of B and n: 

Proposition 3 Randomized Integer-Division (2 L ) hashing with B odd is not uni- 
form for n-grams, if n is even. Otherwise, it is uniform, but not pairwise indepen- 
dent. 

Proof: We see that P(h(a 2k ) = 0) > 2~ L since h(a 2k ) = hi (a)(B°(l +B) +B 2 {\ + 
B) + . . . +B 2k - 2 (\ +B)) mod 2 L and since (1 +B) is even, we have P(h(a 2k )=0)> 
P(hi(xi) = 2 L - 1 Vhi(xi)=0) = l/2 L - 1 . 

For the rest of the result, we begin with n = 2 and B even. 

P(h(ab) =y) = P(Bhi(a) +h l (h) =y mod 2 L ) 

= £/>(/j 1 (b)=.y- J Bzmod2 L ) J P(/j 1 (a)=z) 

z 

= £/ , (/ji(b)=j-Bzmod2 i )/2 L = l/2 L , 

z 

whereas P(h(aa) = y) = P({B + l)fci(a) = y mod 2 L ) = \/2 L since (B + l)x = 
y mod 2 L has a unique solution x when B is even. Therefore h is uniform. This 
argument can be extended for any value of n and for n odd, B even. 

To show it is not pairwise independent, first suppose that B is odd. For any 
string P of length n — 2, consider ra-grams w\ = [3aa and W2 = [3bb . Then 

P(h(wi)=h(w 2 )) = P(B 2 h($)+Bhi(a)+hi{a)=B 2 h($)+Bhi(a)+hi(a) mod 2 L ) 
= P((l+B)(/!i(a)-/zi(b)) mod2 L = 0) 
> P(Ai(a)-Ai(b) =0)+P(Ai(a)-Ai(b) =2 L ~ l ) 
= 2/4 L . 

Second, if B is even, a similar argument shows P(h(wj) = h{w^)) > 2/4 L , where 
wt, = Paa and wa, = [3ba. □ 
These results also hold for any integer-division hash where the modulo is by an 
even number, not necessarily a power of 2. Frequently, such hashes compute their 
result modulo a prime. However, even if this gave uniformity, the GT algorithm 
implicitly applies a "MOD 2 L " operation because it ignores higher-order bits. It 
is easy to observe that if h(x) is uniform over [0,p), with p prime, then h'(x) = 
h(x) mod 2 L cannot be uniform. 



Whether the lack of uniformity and pairwise independence is just a theoretical 
defect can be addressed experimentally. 

4 Count Estimation by Probing 

Count estimates, using the algorithms of Gibbons and Tirthapura or of Bar-Yossef 
et al., depend heavily on the hash function used and the buffer memory allocated to 
the algorithms [GT01 , BYJ K + 02l . This section shows that better accuracy bounds 
from a single run of the algorithm follow if the hash function is drawn from a family 
of fc-wise independent hash functions (k > 2), than if it is drawn from a family of 
merely pairwise independent functions. In turn, this implies that less buffer space 
can achieve a desired quality of estimates. 

Another method for improving the estimates of these techniques is to run them 
multiple times (with a different hash function chosen from its family each time). 
Then, take the median of the various estimates. For estimating some quantity /u 
within a relative error ("precision") of 8, it is enough to have a sequence of random 
variables X for i = l,...,q such that P(\X( — fx\ > £/u) < 1/3 where /u = X,. The 
median of all Xj will lie outside (/a — e/n,^ + e/n) only if more than half the X do. 
This, in turn, can be made very unlikely simply by considering many different 
random variables (q large). Let F; be the random variable taking the value 1 when 
|X — fi\ > e/u and zero otherwise, and furthermore let Y = £? =1 Yj. We have that 
E(Y) < q/3 and so, 3E(Y) /2 < q/2 Then a Chernoff bound says that ICan()6l 

P(Y>q/2) < 

< 



< 
< 

Choosing q = 30 In 1/8, we have P{Y > q/2) < 8 proving that we can make the me- 
dian of the X's within e/u of /u, 1 — 8 of the time for 8 arbitrarily small. On this ba- 
sis, Bar-Yossef et al. [BYJK + 02] report that they can estimate a count with relative 
precision £ and reliability 5 1—8, using + logm) log|) bits of memory and 

0((logm+ logi) amortized time. Unfortunately, in practice, repeated probing 
is not a competitive solution since it implies rehashing all «-grams 30 In 1/8 times, 

5 i.e., the computed count is within e/j of the true answer /j, 1 — 8 of the time 
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Figure 1: For reliability 1 — 8 = 0.95, we plot the memory usage M versus the 
accuracy £ for pairwise (p = 2) and 4-wise (p = 4) independent hash functions 
as per the bound of Proposition 0] Added independence substantially improves 
memory requirements for a fixed estimation accuracy according to our theoretical 
bounds. 

a critical bottleneck. Moreover, in a streaming context, the various runs are made 
in parallel and therefore 30 In 1/8 different buffers are needed. Whether this is 
problematic depends on the application and the size of each buffer. For «-gram 
estimation, in one pass we may be attempting to compute estimates for various 
values of n, each of which would then use a set of buffers. 

The next proposition shows that in order to reduce the memory usage drasti- 
cally, we can increase the independence of the hash functions. In particular, we 
can estimate the count within 10%, 19 times out of 20 by storing respectively 
10,500 and 2,500, and 2,000 «-grams 6 depending on whether we have pairwise- 
independent, 4-wise independent or 8-wise independent hash values. Hence, there 
is no need to hash the «-grams more than once if we can assume that hash values 
are « 4-wise independent in practice (see Fig.Q. 

Theorem 4.1 Let X\,. . . ,X m be a sequence of p-wise independent random vari- 
ables that satisfy \Xj —E(Xj) \ < 1. Let X = Y,i^i> then for C = max(/?,G 2 (X)), we 

6 in some dictionary, for instance in a hash table 




have 



P C 



) 



P/2 



P{\X-X\ >T)< 



e 2/3 T 2 



In particular, when p = 2, we have 



P(\X-X\ >T)< 



2C 



e 2/3 T 2- 



Proof: See Schmidt, Siegel, and Srinivasan, Theorem 2.4, Equation III [SSS93]. 

□ 

The following proposition is stated for n-grams, but applies generally to arbi- 
trary items. 

Proposition 4 Hashing each n-gram only once, we can estimate the number of 
distinct n-grams within relative precision £, with a p-wise independent hash for 
p > 2 by storing M distinct n-grams (M >Sp) and with reliability 1 — 8 where 8 is 
given by 



forAp/M < a < 1 and any p,M. 

Proof: We generalize a proof by Bar-Yossef et al. [ BYJK + 02 ] Let X t be the num- 
ber of distinct elements having only zeros in the first t bits of their hash value. Then 
Xq is the number of distinct elements (Xq = m). For j = 1, ... ,m, let X t ; be the bi- 
nary random variable with value 1 if the j^ 1 distinct element has only zeros in the 
first t bits of its hash value, and zero otherwise. We have that X t = Y!J=\Xtj and so 
E(X t ) = Z"j=i E(X tJ ). Since the hash function is uniform, then P(X tJ = 1) = l/2 ; 
and so E(X t j) = 1/2' , and hence, E(X t ) = m/2' => 2 t E(X t ) = m. Therefore X t can 
be used to determine m. 

Using pairwise independence, we can can show that a 2 (X t ) < § . We have 
tha.ta 2 (X t ) < f because 




More generally, we have 
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By Theorem 14. 11 we have 



P(|X f -m/2'| >em/2') < 



pP/^2'p/ 2 



as long as G 2 (X) > 



m p/2 £ p e p/3 



Let M be the available memory and suppose that the hash function has L bits 
such that we know, a priori, that P(Xl > M) « 0. This is necessary since we do not 
wish to increase L dynamically. It is reasonable since for L and m fixed, P(Xl > M) 
goes down as 1 /M p : 



for M — m/2 L > where we used Theorem 14. II For example, if p = 2, M = 256, 
L = 19, P{X L >M)< 4.4 x 1(T 8 for m = 100,000,000. 

The algorithm returns the value 2' X t > where t' is such that X,i < M and X t i_\ > 
M. (For a given hash function, note that X, is monotone in i and thus t' is uniquely 
determined by the input and the hash function.) The upshot is that t' is itself a 
random quantity that depends deterministically on the hash function and the input 
(the same factors that determine X t .) 

We can bound the error of this estimate as follows. 



P{X L > M) 



= P(X L -m/2 L >M-m/2 L ) 
< P(\X L -m/2 L \>M-m/2 L ) 



< / pmax(m/2 L ,p) \ 
- \eV\M-m/2 L ) 2 ) 



P{\2 t X t ,-m\ >zm) 



Lilt i m, , , 

P(\X t --\>e-)P(t' = t) 



f=0,...,L 



= E P(\X t --\>£-)P(X t _ l >M,X t <M). 



t=0,...X 



Splitting the summation in two parts, we get 



P(\2''x t/ -m\ >em) 



^ L /5 (!^-^l^ £ ^)+ I P(X t -,>M,X t <M) 

t=() L L t=t,...,L 

= P(x T _ l >M) + £P(\X t -^\>e^) 
t=o z L 

< PiXr^-m/l'- 1 >M-m/2 t - l ) + Y J ' 



, ^ mP/ 2 £PeP/ 3 

< P ( |X r _ 1 - m /2.-|>«- m /2.-.) + £i^ 7J 

- \2 T - l e 2 l\M-m/T- Y ) 2 ) + ^ mP/ 2 eP e P/ 3 

< pPl 2 mPl 2 + ^/ 2 (2 r >/ 2 -l) 



2 (f-i)p/2 e p/3( M _ m /2'"- 1 )/' mP/ 2 £PeP/ 3 (2P/ 2 - 1) 

where we assumed that p < m/2'~ 1 . 

Choose t G {0, . . . ,L} such that aM/4 < 'f < aM /2 for some a < 1 satisfying 
ocM /4 > p then 

p?/2 m P/2 p p/2 M p/2 a p/2 



2('-*/V/ 3 (M -m/2'- l )P ~ eP/ 3 (l-a)PMP 

whereas 



pP/ 2 (2'P/ 2 — 1) pP/ 2 4P/ 2 



m p/ 2 eP e p/ 3 (2P/ 2 - 1) ~ £P<xP/ 2 MP/ 2 eP/ 3 (2P/ 2 - 1) ' 
Hence, we have 

Setting a = 1/2, we have 

, p p/ 2 2P/ 2 yPl 2 %Pl 2 

P(\2 t X t ,-m\ >em) < t—r. 7T + tz—FT, — « r- 

Vl ~ ; _ e Pl 3 MPl 2 £PMP/ 2 eP/ 3 (2P/ 2 -1) 

For different values of and e, other values of a can give tighter bounds: the best 
value of a can be estimated numerically. The proof is completed. □ 



It may seem that the result of Proposition|4]is independent of the size of the data 
set and of the number of distinct symbols (m). Indeed, it provides a fixed accuracy 
£ for a given memory budget (M) irrespective of the data source. However, as the 
proof shows, the number of bits in the hash values need to grow with log m. This 
requires additional computational costs. 

Again the following corollary applies not only to rc-grams but also to arbitrary 
items. It follows from setting M = 576/s 2 , p = 2, and a = 1 — £. 

Corollary 1 With M = 576/s 2 and p = 2, we can estimate the count of distinct 
n-grams with reliability (I — 8) of 99% for any £ > 0. 

Proof: From Proposition |4j consider 



for a = 1 — £, M 




ehl6 

Taking the limit as £ tends to 0, on the basis that the probability of a miss (P(\X t — 
m/2'\ > em/2')) can only grow as £ diminishes, we have that 8 is bounded by 
10/(e3576) « 0.008. □ 

This significantly improves the bound of 1/6 for 8 given in Bar-Yossef etal. IBYJK+02I 
for the same value of M. 

Because /7-wise independence implies p — 1-wise independence, memory us- 
age, accuracy and reliability can only improve as p increases. For p large (p S> 4), 
the result of Proposition 0] is no longer useful because of the p p factor. Other in- 
equalities on P(\X —fJ.\) for /?-wise independent X's should be used. 



4.1 Entropy and Iceberg Estimation 

With only minor modifications, the techniques used for counting can be modified 
to provide estimates of various other statistics. We consider entropy [GMV06, 
CBM06 BDKR02] and iceberg counts lFSGM+981 : the GT algorithm is modified 
to track not only the existence of items hashing to zero, but also the required prop- 
erties (occurrence counts, for instance) of these items. From this, the statistic can 
be estimated for the entire data stream. 



Let / be the entire set of n-grams (N = card(/)) and /' be the set of probed 
7i-grams with m' = card(/') (m' G [M/2,M)). For i G we know the exact number 
of occurrences of i and so the probability that any given «-gram in / is i, P(f), is 
given by P(i) = ft/N. Hence, we can compute Lr'e/'^ , (0l°g^'(0 exactly. A practi- 
cal one-pass estimate of the Shannon entropy of the «-grams (5Li£iP{i)]ogP(i)) is 
m / m 'Hiei' P{i)l°gP(i) since we have a tight estimate for m. Unfortunately, we do 
not have any theoretical results on this estimator. If we were to allow two passes 
over the data, an unassuming algorithm can estimate the entropy with theoretical 
bounds [GMV06|. It first samples and then probes. 

Various "iceberg count" properties can be handled similarly. In these problems, 
one is given a predicate on the number of occurrences /; of an item i. We seek to 
estimate the number of distinct items satisfying the predicate. For example, if the 
predicate is > c for some fixed c", then we wish to estimate card({/ G I\fi > 
c}). For instance, the input text aabaabb contains the 2-grams aa (with count 
2), ab (with count 2), ba (with count 1) and bb (with count 1). If the predicate 
Q(i) is "item / occurs exactly twice", then the answer to the iceberg query is 2 
(because distinct items aa and ab satisfy Q(i). Considering the predicate "/,• > 0", 
the iceberg-count problem generalizes the problem of counting distinct n-grams. 

While we do not have any theoretical bound on the accuracy of such iceberg es- 
timate, we can still model the problem statistically. If we are picking at random m' 
elements from a population of size m having r <m elements satisfying a predicate, 
the number of elements satisfying the constraint, Y, follows an hypergeometric 
distribution Y with mean m'r/m and with variance m' r ^ m /)^ m .!" . Therefore, by 

' m-(m— l) J 

Chebyshev's inequality, we have that P(\Y —m'r/m\ < em'r/m) < ^2 m ^[^Z7)^ < 
-TTT- Hence, we should allocate roughly M sa ^ units of storage to have an ac- 
curacy of £, 19 times out of 20. Choosing m' = ensures that the number of 
interesting items found (F) is larger than 190, 19 times out of 20. Hence, if an up- 
per bound on m and a non-trivial lower bound on r were known, further theoretical 
results could be possible. 

If M is small and the distribution is biased (e.g. Zipfian), such as is the case 
with «-grams of English text, we expect these entropy and iceberg count estimates 
to be poor. 

4.2 Simultaneous Estimation 

The GT approach can also be applied to estimate the number of 1-grams, 2-grams, 
. . . Ti-grams in one pass over the data. Initially, one might seek hash families such 
that if / is a P + 1-gram and j is its [3-gram suffix, then h(i) has zero in its first t 
bits whenever the hash value h(j) has zero in its first t bits. For example, h(a) = 



=>• h(ba) = 7 . Unfortunately, such a family of hash functions cannot be pairwise 
independent because if h(ba) = then h(ca) = for any value of c. However, a 
straightforward approach can suffice for simultaneous estimation. Separate buffers 
(n in total) can be kept for each string length (1-gram, 2-gram, . . . , «-gram), and 
for simplicity we can assume each buffer is of the same size, M. 

The straightforward approach works as follows: As each of the N symbols in 
the stream is processed, we hash the n new «-grams of size 1 , 2, . . . , n respectively. 
If each of the n hashes is recursive, then each can be updated from its previous value 
in 0(1) time. However, we can also use semi -recursive hashes with this approach, 
because the hash value for a k-gmm can be updated, in 0(1) time, to become the 
hash value for the associated k+ 1-gram. 

In particular, the semi-recursive "rc-wise" hashing scheme examined before be- 
comes more attractive when simultaneous estimation is attempted. 

5 Experimental Results 

Experiments are used to assess the accuracy of estimates obtained by several hash 
functions on some input streams. As well, the experiments demonstrate that the 
techniques can be efficiently implemented. Our code is written in C++ and is 
available upon request. 

5.1 Test Inputs 

One complication is that the theoretical predictions are based on worst-case anal- 
ysis. There may not be a sequence of symbols realizing these bounds. As a re- 
sult, our experiments used the ra-grams from a collection of 1 1 texts 8 from Project 
Gutenberg. We also used synthetic data sets generated according to various gen- 
eralized Zipfian distributions. Since we are analyzing the performance of several 
randomized algorithms, we ran each algorithm 100+ times on each text. We cannot 
run tests on inputs as large as would be appropriate for corpus linguistics studies: 
to complete the entire suite of experiments in reasonable time, we must limit our- 
selves to texts (for instance, Shakespeare's First Folio) where one run takes at most 
a few minutes. 

7 We abuse the notation by using h to denote both the hash function for bigrams and the hash 
function for unigrams. 

8 The 1 1 texts are eduhalO (The Education of Henry Adams), utrkj 10 (Unbeaten Tracks in Japan), 
utopilO (Utopia), remuslO (Uncle Remus His Songs and His Sayings), btwoelO (Barchester Towers), 
OOwsl 10 (Shakespeare's First Folio), hcathlO (History of the Catholic Church), rlchnlO (Religions of 
Ancient China), esymnlO (Essay on Man), hioajlO (Impeachment of Andrew Johnson), and wflshlO 
(The Way of All Flesh). 



Table 1: Maximum error rates £ 19 times out of 20 for various amounts of memory 
(M) and for p-wise independent hash values according to Proposition |4]. 





256 


1024 


2048 


65536 


262144 


1048576 


p 


= 2 


86.4% 


36.8% 


24.7% 


3.8% 


1.8% 


0.9% 


p 


= 4 


34.9% 


16.1% 


11.1% 


1.8% 


0.9% 


0.5% 


p 


= 8 


30.0% 


14.1% 


9.7% 


1.6% 


0.8% 


0.4% 



5.2 Accuracy of Estimates 

We have theoretical bounds relating to the error £ observed with a given reliabil- 
ity (typically 19/20), when the hash function is taken from a p-wise independent 
family. (See Table [2) But how close to this bound do we come when «-grams 
are drawn from a "typical" input for a computational-linguistics study? And do 
hash functions from highly independent families actually enable more accurate 9 
estimates? 

Figure |2] shows the relative error £ observed from four hash functions (100 
estimations with each). Estimates have been ranked by decreasing £, and we see 
ID37 had more poorer runs than the others. Figure|3]shows a test input (remuslO) 
that was the worst of the 1 1 for several hash functions, when M = 256. ID37 seems 
to be doing reasonably well, but we see 10-wise independent hashing lagging. 

To study the effect of varying M, we use the 5 tn -largest error of 100 runs. 
This 95 tn -percentile error can be related to the theoretical bound for £ with 19/20 
reliability. Figure |2(b")| plots the largest 95 tn -percentile error observed over 11 test 
inputs. It is apparent that there is no significant accuracy difference between the 
hash functions. The rc-wise independent hash alone has a strong guarantee to be 
beneath the theoretical bound. However, over the eleven Gutenberg texts, the others 
are just as accurate, according to our experiments. 

5.3 Using a Wider Range of Values for M 

An important motivation for using p-wise independent hashing is to obtain a reli- 
able estimate while only hashing once, using a small M. Nevertheless, we have thus 
far not observed notable differences between the different hash functions. There- 
fore, although we expect typical values of M to a be few hundred to a few thousand, 
we can broaden the range of M examined. Although the theoretical guarantees for 

9 The "data-agnostic" estimate from Sect. |2] is hopelessly inaccurate: it predicts 4.4 million 5- 
grams for Shakespeare's First Folio, but the actual number is 13 times smaller. 



tiny M are poor, perhaps typical results will be usable. And even a single buffer 
with M = 2 20 is inconsequential when a desktop computer has several gibibytes of 
RAM, and the construction of a hash table or B-tree with such a value of M is still 
quite affordable. Moreover, with a wider range of M, we start to see differences 
between some hash functions. 

We choose M = 16, 16 2 , 16 3 and 16 4 and analyze the 5-grams in the text OOwsl 
(Shakespeare's First Folio). There are approximately 300,000 5-grams, and we 
selected a larger file because when M = 16 4 it seems unhelpful to estimate the 
number of 5-grams unless the file contains substantially more 5-grams than M. 

Figure |4] shows the 95 tn -percentile errors for Shakespeare's First Folio, when 
5-grams are estimated. There are some smaller differences for M = 65536 (sur- 
prisingly, the 5-wise hash function, with a better theoretical guarantee, seems to 
be slightly worse than Cohen's hash functions). However, it is clear that the theo- 
retical deficiencies in ID37 finally have an effect: it is small when M = 4096 but 
clear at M = 65536. (We observed similar problems on Zipfian data also.) To be 
fair, this non-uniform hash is still performing better than the pairwise bound, but 
the trend appears clear. Does it, however, continue for very large M? 

5.4 Very Large M 

Clearly, it is only sensible to measure performance when M <C m. Therefore, we 
estimate the number of 10-grams obtained when all plain-text files in the Guten- 
berg CD are concatenated. When M = 2 20 , 10-wise independent hashing had an 
observed 95 th -percentile error of 0.182% and General had 0.218%. The ID37 
error was somewhat worse, at 0.286%. (The theoretical pairwise error bound is 
0.908% and the 10-wise bound is 0.425%. ) Considering the M = 65536 case from 
Fig. HJ we see no experimental reason to prefer «-wise hashing to General, but 
ID37 looks less promising. However, n = 10, B = 37 is a non-uniform combination 
for Integer Division. 

5.5 Caveats with Random-Number Generators 

To observe the effect of fully independent hashing, we implemented the usual (slow 
and memory-intensive) scheme where a random value is assigned and stored when- 
ever a key is first seen. Clearly, probabilistic counting of rc-grams is likely to expose 
deficiencies in the random-number generator and therefore different techniques 
were tried. The pseudorandom-number generator in the GNU/Linux C library was 
tried, as were the Mersenne Twister (MT) [MN98] and also the Marsaglia-Zaman- 
James (MZJ) generator IM Z871 Uam90l lBou98 1. We also tried using a collection 
of bytes generated from a random physical process (radio static) [Haa98 1. 



For M=4096, the 95 -percentile error for text OOwsl was 4.7% for Linux 
rand ( ) , 4.3% for MT and 4. 1 % for MZJ. These three pseudorandom number gen- 
erators were no match for truly random numbers, where the 95% percentile error 
was only 2.9%. Comparing this final number to Fig. 0] we see fully independent 
hashing is only a modest improvement on Cohen's hash functions (which fare bet- 
ter than 5%) despite its stronger theoretical guarantee. 

The other hash functions also rely on random-number generation (for h\ in 
Cohen's hashes and ID37; for h\. ..h„ in the «-wise independent hash). It would 
be problematic if their performance were heavily affected by the precise random- 
number generation process. However, when we examined the 95^ -percentile er- 
rors we did not observe any appreciable differences from varying the the pseudo- 
random-number generation process or using truly random numbers. (The graphs 
are shown in Appendix |A]) Surprisingly, the pseudorandom generators may have 
been marginally better than the truly random numbers. 

5.6 95 -Percentile Errors Using Zipfian Data 

The various hash functions were also tested on synthetic Zipfian data, where the 
probability of the k^ 1 symbol is proportional to k~ s . (We chose s £ [0.8,2.0].) 
Each data set had N rs 10 5 , but for larger values of s there were significantly fewer 
Ti-grams. Therefore, measurements for M=65536 would not be meaningful and are 
omitted. 

Some results, for n = 5, are shown in Fig. |5] The ID37 method is noteworthy. 
Its performance for larger M is badly affected by increasing s. The other hash 
functions are nearly indistinguishable in almost all other cases. Results are similar 
for 5 -grams and 10-grams (which are not shown), except that s can grow slightly 
bigger for 10-grams before ID37 fails badly. 

5.7 Choosing Between Cohen's Hash Functions 

The experiments reported so far do not reveal a clear distinction, at least at the 95 tn 
percentile error, between Cohen's General and his Cyclic polynomial hashes. 
This is somewhat surprising, considering the theoretical differences (nonunifor- 
mity versus pairwise independence) shown in Lemma|2]and Lemma|3] 

The ratio m/M may be significant (because this ratio affects how many hash 
bits are used) and our experiments thus cover two different ratios. We first used 
synthetic Zipfian data (s = 2), looking for 5-grams, since that combination revealed 
the weakness of ID37. Since m = 14826 for our Zipfian data set, choosing M = 64 
gives a ratio about 231 and M = 1024 gives a ratio of about 14. 



Results, for more than 2000 runs, are shown in Table |2] The top of the table 
shows £ values (percents), with boldfacing indicating a case when one technique 
had a lower error than the other. It shows a slightly better performance for CYCLIC. 
Means also slightly favour CYCLIC. This is consistent with the experimental re- 
sults reported by Cohen IC oh97l . 

Table 2: Comparing polynomial hashes Cyclic and General, Zipfian data set. 



percentile 


Cyclic 
M=64 M=1024 


General 
M=64 M=1024 


25 


3.60 


0.931 


3.60 


0.931 


50 


7.06 


1.98 


8.49 


2.01 


75 


14.0 


3.41 


13.7 


3.60 


95 


30.9 


6.11 


31.2 


6.22 


mean 


10.6 


2.45 


10.7 


2.51 



We also ran more extensive tests using OOwsl, where there seems to be no no- 
table distinction. 10,000 test runs were made. Results are shown in Table |3] The 
overall conclusion is that the theoretical advantage held by General does not 
carry over. Experimentally, these two techniques cannot be distinguished mean- 
ingfully by our tests. 

Table 3: Comparing polynomial hashes Cyclic and General, data set OOwsl 



percentile 


Cyclic 
M=64 M=1024 


General 
M=64 M=1024 


25 


5.79 


1.30 


5.79 


1.30 


50 


10.7 


2.58 


10.7 


2.69 


75 


18.2 


4.55 


19.0 


4.55 


95 


30.6 


7.69 


30.6 


7.69 


mean 


12.5 


3.15 


12.6 


3.14 



5.8 Estimating Iceberg Counts and Entropy 

Although we do not have a theoretical bound to compare with, experiments tested 
the approach to single-pass iceberg-count and entropy estimation given Section l4~T1 
On our 11 data sets, the amounts of relative error (for 5 -grams) observed with a 



Table 4: Time (seconds) to process all Gutenberg CD files, 10-grams. 



Hashing 


M = 2 W 


M = 2 20 


10- wise 


794 


938 


ID37 


268 


407 


Cyclic 


345 


486 


General 


352 


489 



95% reliability are shown in Figs.|5J0 (The data is shown separately for the 6 data 
sets with 100,000 or more 5-grams, since the M = 65536 case is uninteresting for 
the other 5 data sets.) The iceberg-count estimates the number of distinct «-grams 
occurring at least 10 times. 

We see that the error of estimates does decrease quickly with M, and that when 
M < 4096, the accuracy was poor. However, the accuracy when M = 4096 might 
be adequate for some applications. 

While we expected poor results on English texts due to the biased distribution, 
the results suggest that useful theoretical bounds are possible. 

5.9 Speed 

Speeds were measured on a Dell Power Edge 6400 multiprocessor server (with 
four Pentium III Xeon 700 MHz processors having 2MiB cache each, sharing 
2 GiB of 133 MHz RAM). The OS kernel was Linux 2.4.20 and the GNU C++ 
compiler version 3.2.2 was used with relevant compiler flags -02 -march=i686 
-fexceptions. The STL map class was used to construct look-up tables. 

Only one processor was used, and the data set consisted of all the plain text 
files on the Project Gutenberg CD, concatenated into a single disk file containing 
over 400 MiB and approximately 116 million 10-grams. For comparison, this file 
was too large to process with the Sary suffix array [Tak05 1 package (version 1.2.0), 
since the array would have exceeded 2 GiB. However, the first 200 MB was suc- 
cessfully processed by Sary, which took 1886 s to build the suffix 10 11 array. The 
SUFARY [Yam05| (version 2.3.8) package is said to be faster than sary [Tak05|. 
It processed the 200 MB file in 2640 s and then required 95 s to (exactly) compute 
the number of 5-grams with more than 100,000 occurrences. 

'""Various command-line options were attempted and the reported time is the fastest achieved. 
"Pipelined suffix-array implementations reportedly can process inputs as large as 4 GB in 
hours |DMKS05|. 



From Table |4] we see that «-gram estimation can be efficiently implemented. 
First, comparing results for M = 2 20 to those for M = 2 10 , we see using a larger 
table costs roughly 140 s in every case. This increase is small when considering 
that M was multiplied by 2 10 and is consistent with the fact that the computational 
cost is dominated by the hashing. Comparing different hashes, using a 10-wise 
independent hash was about twice as slow as using a recursive hash. Hashing with 
ID37 was 15-25% faster than using Cohen's approaches. 

Assuming that we are willing to allocate very large files to create suffix arrays 
and use much internal memory, an exact count is still at least 10 times more expen- 
sive than an approximation. Whereas the suffix-array approach would take about 
an hour to compute «-gram counts over the entire Gutenberg CD, an estimate can 
be available in about 6 minutes while using very little memory and no permanent 
storage. 

6 Conclusion 

Considering speed, theoretical guarantees, and actual results, we recommend Co- 
hen's General. It is fast, has a theoretical performance guarantee, and behaves at 
least as well as either ID37 or the «-wise independent approach. General is pair- 
wise independent so that there are minimal theoretical bounds to its performance. 
The «-wise independent hashing comes with a stronger theoretical guarantee, and 
thus there can be no unpleasant surprises with its accuracy on any data set. Yet 
there is a significant speed penalty for its use in our implementation. The speed 
gain of ID37 is worthwhile only for very small values of M. Not only does it lack a 
theoretical accuracy guarantee, but for larger M it is observed to fall far behind the 
other hashing approaches in practice. Except where accuracy is far less important 
than speed, we cannot recommend ID37. 

Iceberg-count and entropy estimates on English text showed good decay with 
larger values of M warranting further theoretical investigations. 

There are various avenues for follow-up work that we are pursuing. Further 
improvements to the theoretical bound seem possible, especially for larger values 
of p. A more sophisticated solution to the simultaneous estimation problem may 
be possible, and it could be experimentally evaluated against the approach sketched 
in Section l4~2l and against suffix-array methods. 

Efficient frequent string mining have been recently proposed [FHK05 1 to find 
frequent substrings in one string that are rare in another. However, suffix arrays 
do not support fast search for the most frequent phrases containing a given word. 
While suffix arrays allow us to count occurrences of a substring, they provide no 
means to count the occurrences of the various «-grams beginning with a given 



substring, let alone the «-grams containing a substring. 
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A Pseudorandom-Number Generators and Accuracy 



The following graphs show that varying the source of random (or pseudorandom) 
numbers had little effect on General, Cyclic or the 72-wise random hashes. 



10 



0.1 



0.01 



0.001 



10 



100 



rand() 
Mersenne Twist 
Atmospheric Noise 
Marsaglia/Zaman 
pairwise bound 
5-wise bound 



^ 




1000 

M 



10000 100000 



Figure 8: CYCLIC is not affected much by the source of random numbers. 
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Figure 9: General is similarly unaffected. 
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Figure 10: The «-wise independent hash is similarly unaffected 
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(a) Count estimate errors over Shakespeare's First 
Folio (OOwsllO), 100 runs estimating 10-grams 
with M = 2048. 
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Figure 2: Average relative error £ after 100 runs and over four hash functions. 



w 0.15 





ID37 




general polynomial 




cyclic polynomial 




n-wise independent 







10 20 30 40 50 60 70 80 90 100 

trial rank 

Figure 3: Errors on remuslO ("Uncle Remus His Songs and His Sayings"), from 
100 runs estimating 10-grams with M = 256. 
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Figure 4: 95 -percentile error values (for 5-gram estimates) on OOwsl for various 
hash families, over a wide range of M. Our analysis does not permit prediction of 
error bounds when M = 16. 







ID37 — i — 






general polynomial 






cyclic polynomial 






n-wise independent a 






pairwise bound ------ 






5-wise bound 



















ID37 — . — 






general polynomial 






cyclic polynomial 






n-wise independent a 






pairwise bound ------ 






5-wise bound 





















100 1000 10000 100000 





ID37 — 




general polynomial 




cyclic polynomial 




n-wise independent ° 




pairwise bound 




5-wise bound 













ID37 






general polynomial 






cyclic polynomial 






n-wise independent 






pairwise bound 






5-wise bound 

















10000 100000 



Figure 5: n=5 Zipfian data, s = 1 (top left), s = 1.2 (top right), s = 1.6 (bottom left) 
and s = 2 (bottom right). 




Figure 6: Relative errors on smaller data sets: rlchn, remus, utopi, esymn and hioaj. 
Points with zero error are omitted, due to the logarithmic axes. 
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(a) Entropy estimates 
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(b) Iceberg-count estimates 



Figure 7: Relative errors on larger data sets: OOwsl, eduha, heath, wflsh, btowe 
and utrkj. 



